Artificial Intelligence in the Life Sciences — Latest Matching Preprints

1

A transformer model explaining mechanisms of drug therapeutic and adverse effects

Ke, J.; Melamed, R. D.

2026-05-13 genetic and genomic medicine 10.64898/2026.05.11.26352917 medRxiv

Top 0.1%

4.3%

Show abstract

Understanding which disease genes are altered by a drug can provide insight into the biology of effect, help us understand adverse drug effects, and suggest new drug uses. Here, we build on our model Draphnet in a new formulation with a similar goal. Draphnet was designed to explain drug therapeutic and side effects by learning a network connecting drugs to the disease genes they alter. Our new model, DraPhormer, has a similar goal but instead of relying on a linear model, learning of drug to gene connections uses a transformer model. DraPhormer integrates drug molecular data, disease genetics, and known drug effects on diseases, along with language models representing all of these entities. We show in simulations that DraPhormer can explain the genetic mechanisms of drug effects. Then, we present our design for incorporating drug and disease biology into the model. Finally, we benchmark the models ability to learn drug indications and side effects in real data.

2

Revisiting CPUs for Protein Folding: Xeon-Based Acceleration of AlphaFold2

Chaudhary, N.; Yang, W.; Kalamkar, D.; Zhou, J.; Ghosh, S.; Xia, L.; Tiwari, M.; Heinecke, A.; Kaul, B.; Misra, S.

2026-05-29 genomics 10.64898/2026.05.27.728222 medRxiv

Top 0.1%

2.8%

Show abstract

Protein structure prediction via AlphaFold2 has revolutionized drug discovery, yet its end-to-end execution remains computationally intensive. While GPUs are traditionally favored for deep learning, the AlphaFold2 algorithm consists of heterogeneous phases -- preprocessing with sparse database searches and model inference with low-arithmetic-intensity attention modules -- that present unique architectural challenges. In this work, we address these bottlenecks by introducing Open-Omics-AlphaFold2, a highly optimized implementation for Intel(R) Xeon(R) CPU. By leveraging the CPUs versatility in handling both sparse preprocessing algorithms and dense matrix operations via Intel Advanced Matrix Extensions (AMX), we accelerate the entire pipeline end-to-end. Our optimization strategy employs multi-level parallelism -- spanning multiprocessing, multi-threading, and vectorization -- alongside cacheaware tiling and operator fusion. Our results demonstrate that, on a Xeon CPU, Open-Omics-AlphaFold2 achieves 2 7.58 speedup for preprocessing and 19.8 29.2 speedup for model inference over baseline Deepmind-AlphaFold2. Moreover, for a proteome of 391 proteins, Open-Omics-AlphaFold2 running on a dual-socket Intel Xeon 6980P system achieves a remarkable 76% higher through-put over the state-of-the-art GPU-accelerated solution, FastFold, running on a single-socket Intel Xeon 6980P CPU with an NVIDIA H100 offioad. Code availabilityBaremetal: https://github.com/IntelLabs/open-omics-alphafold Containerized: https://github.com/IntelLabs/Open-Omics-Accelera tion-Framework/tree/main/pipelines/alphafold2-based-protein-folding

3

Deep learning models for chemical perturbation prediction do not yet utilise drug molecular features

Bai, J.; Prince, S.; Nitschke, G. S.

2026-05-15 bioinformatics 10.64898/2026.05.13.724458 medRxiv

Top 0.1%

2.5%

Show abstract

Recent deep learning models for L1000 chemical perturbation prediction incorporate dedicated drug molecular encoders. We retrained seven such models from scratch with zeroed or shuffled drug inputs, and compared them with a multilayer perceptron that uses only cell-line basal expression. Under drug-blind evaluation, ablation caused negligible performance changes and the drug-free baseline matched all models. Current architectures do not yet utilise drug molecular features for generalisation to unseen compounds.

4

Desktop-Scale Hit-Point Discovery for Intrinsically Disordered α-Synuclein Using State-Space Compression and a Discrete Phase-Interference Search Operator

Kim, D. H.; Khenmedekh, G.-O.; Park, i.; Kim, S.

2026-06-28 bioinformatics 10.64898/2026.06.22.733879 medRxiv

Top 0.1%

2.4%

Show abstract

The accessible chemical space dwarfs any tractable screening budget, and most artificial intelligence drug discovery pipelines respond by docking and ranking a small sublibrary. The resulting hit list is agnostic to selectivity, brain penetration, toxicity, synthetic accessibility, and chemical novelty. We present ISTP-DPISO DrugEngine, an end-to-end engine developed by ISTP Tech that integrates the Local Information Criticality Principle (LICP) with a Discrete Phase-Interference Search Operator (DPISO). We demonstrate the engine on the intrinsically disordered protein (IDP) -synuclein, whose non-amyloid-component (NAC, residues 61-95) drives Parkinson-associated aggregation. The resulting LICP active set focuses the expensive LICP-DPISO scoring: in a production-scale run, the engine compressed a ~8.46x108-molecule mirror to a 10,000,000-molecule active set (~85-fold) before scoring, then converged to a compact, safety-gated shortlist plus de novo designs. The entire campaign ran on a single desktop workstation, without any high-performance-computing cluster. Three engine-prioritized, commercially available candidates (2-D08, Uralenol, Herbacetin) and an (-)-epigallocatechin gallate (EGCG) positive control were then tested in a thioflavin-T (ThT) aggregation assay at 100 {micro}M: all three engine-nominated candidates suppressed -synuclein aggregation, giving perfect prospective inhibitor-call concordance (3/3 nominated); together with the EGCG positive control, all four assayed compounds inhibited aggregation (4/4 total), two by [≤]80% plateau reduction. ISTP-DPISO DrugEngine reframes virtual screening from post-hoc score fusion to a single, state-space-compressed, safety-gated, experimentally validated discovery pipeline.

5

CardioSafe: Multi-task prediction of cardiac ion channel activity with reverse-leak audited benchmarking

Jovanovic, M.; Weidener, L. S.; Brkic, M.; Ulgac, E.; Meduri, A.

2026-05-12 bioinformatics 10.64898/2026.05.06.723181 medRxiv

Top 0.1%

2.2%

Show abstract

Drug-induced inhibition of the hERG potassium channel is the leading cause of cardiac safety-related drug attrition, but the Comprehensive in Vitro Proarrhythmia Assay (CiPA) framework requires activity data on multiple cardiac ion channels to assess proarrhythmic risk. We present CardioSafe, a three-branch multi-task neural network with cross-attention fusion that integrates chemical fingerprints, ChemBERTa embeddings, and predicted L1000 transcriptomic features to predict blocker status and potency for hERG, Nav1.5, and Cav1.2, with an exploratory IKs head. CardioSafe was trained on the largest publicly reported multi-channel cardiac ion channel dataset, combining ChEMBL 36 with the hERGCentral database (331127 hERG, 3160 Nav1.5, 1138 Cav1.2, and 115 IKs compounds), curated under a pharmacology-aware policy that retains censored measurements and inhibition-percentage votes. Under Tanimoto-similarity-controlled splits, CardioSafe outperforms the leading published comparators (CToxPred2 and CardioGenAI) on the data-rich hERG head; on the smaller Nav1.5 and Cav1.2 heads the standard evaluation is statistically inconclusive. A reverse-leak audit revealed that 22% of Nav1.5 and 21% of Cav1.2 test compounds were present in published comparators training data (92% as exact compound matches); after removing these contaminated compounds, CardioSafes lead on Nav1.5 and Cav1.2 also reaches statistical significance, demonstrating that prior cross-publication benchmarks for these channels were inflated by training-data overlap. Scientific contributionWe present the first multi-task neural network jointly predicting blocker activity for the three primary CiPA cardiac ion channels (hERG, Nav1.5, Cav1.2) within a single architecture. We introduce a reverse-leak audit methodology that reveals systematic test-set contamination in cross-publication cardiac safety benchmarks, establishing a stricter evaluation protocol. We provide the empirical test of predicted L1000 transcriptomic features as auxiliary input for cardiac ion channel prediction and document a well-characterized negative result. Graphical abstractCardioSafe encodes each query SMILES with three branches (chemical fingerprints + descriptors, pretrained ChemBERTa, and predicted L1000 transcriptomic signatures), fuses them via a cross-attention block with four learnable per-channel query tokens, and emits binary blocker calls plus pChEMBL regression for hERG, Nav1.5, Cav1.2, and (exploratory) IKs. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=59 SRC="FIGDIR/small/723181v1_ufig1.gif" ALT="Figure 1"> View larger version (13K): org.highwire.dtl.DTLVardef@1c0ba2aorg.highwire.dtl.DTLVardef@1fe3a0borg.highwire.dtl.DTLVardef@194de8aorg.highwire.dtl.DTLVardef@9e4f74_HPS_FORMAT_FIGEXP M_FIG C_FIG

6

Disentangling the contribution of disease genes to drug therapeutic and side effects

Lalagkas, P. N.; Melamed, R. D.

2026-05-05 genetic and genomic medicine 10.64898/2026.05.04.26352378 medRxiv

Top 0.1%

2.1%

Show abstract

Most clinical trials fail due to either lack of efficacy or safety concerns. Human genetics can address both failure reasons: disease-associated genes are not only promising therapeutic targets but also predict drug side effects. However, because the same genetic signal underlies both outcomes, we need methods that disentangle which disease genes mediate therapeutic benefit versus adverse side effects. We use DraphNet, our previously developed model that maps drug molecular effects onto disease genes to generate two gene sets per drug: one linked to its therapeutic effects (IND genes) and one linked to its side effects (SE genes). We show that IND and SE genes overlap for 76% of the tested drugs (compared to a null model). We also show that drugs sharing greater IND similarity also have greater SE similarity ({rho}=0.57, p<1e-300). To show how our approach enables insights into drug biology, we construct groupings of drugs based on their IND and SE genes. We find that drugs in the same IND grouping are enriched for co-occurrence in the same SE grouping (OR=212.37). We present two examples to illustrate the kind of insights this network enables: identification of drugs with shared IND but distinct SE genes as repurposing candidates, and identification of drugs with shared SE but distinct IND genes to assist treatment selection in patients with comorbidities. Finally, we develop a neural network that directly links drug molecular effects onto disease genes and learns a gene-level score that quantifies each genes relative contribution to drug therapeutic versus side effects on disease.

7

A control-validated pan-proteome deep-learning pipeline nominates GPR35 as a candidate target of the orphan bacterial metabolite ligiamycin A

Martin, J.

2026-07-06 bioinformatics 10.64898/2026.07.01.735807 medRxiv

Top 0.1%

1.8%

Show abstract

Most microbial natural products with documented bioactivity lack an identified molecular target, which limits their development. We present an open, control-validated computational pipeline for natural-product target hypothesis generation. It combines a pan-proteome deep-learning drug-target interaction (DTI) model (a graph neural-network ligand encoder, an ESM-2 protein language-model encoder, and bidirectional cross-attention) with bias-corrected ranking and control-anchored molecular docking. Applying it to ligiamycin A, a 2022-described Streptomyces/Achromobacter co-culture decalin-amino-maleimide with no reported target, we find that the predicted interactions of the compound are dominated by class-A G-protein-coupled receptors. Using a drug with a known target (losartan) we identify and correct a frequent-hitter bias in the raw model; after correction the standout candidates are uniformly class-A GPCRs, led by the orphan receptor GPR35. Structure-based docking with matched positive and negative controls across three candidates corroborates GPR35 specifically: ligiamycin A scores comparably to the known GPR35 agonist zaprinast at the agonist pocket (-8.1 vs -8.3 kcal/mol; non-binder floor -5.5), whereas FFAR1 is excluded and histamine H2 is inconclusive. We propose GPR35 as a prioritized, experimentally testable target and release the workflow as a reusable tool. The result is a computational hypothesis that requires experimental validation.

8

Practical Use of Advanced AI Frameworks on Real-Life Scientific Problems: Three Case Studies

Gulluoglu, H. S. A.; Baby, J.; Bagul, K. M.; Basangari, B. R.; Bathini, S. A.; Chalamalla, N. K. R.; Dcunha, J.; Gupta, O.; Huang, L.; Jiang, X.; Naidu, Y. R.; Sathishkumar, G.; Sehrawat, M.; Thota, S. L.; Thuvara, D.; Vanguri, M. B.; Yin, J.; Jugder, B.-E.; Lusky, I. E.; Li, J.; Sinitskiy, A.

2026-06-29 bioinformatics 10.64898/2026.06.23.734132 medRxiv

Top 0.1%

1.7%

Show abstract

Agentic artificial intelligence (AI) systems increasingly claim to automate scientific research, yet independent evaluations report persistent gaps between those claims and demonstrated capability. We tested frontier agentic AI systems on three practical problems: prediction of treatment non-response in immune-mediated inflammatory diseases, optical chemical structure recognition for literature mining, and prediction of drug-design-related properties from small datasets. Each problem was first assigned to autonomous frameworks and then reattempted as human-led, AI-assisted work. Autonomous runs failed in most cases, while human-led work produced reusable resources and modest but defensible performance, including new evidence for possible mechanisms of treatment resistance and a more practical benchmark for mining chemical structures from scientific papers. Property prediction was the single task on which one autonomous AI framework matched the human expert. We conclude that current frameworks can carry out engineering and analysis once a human expert leads the project, but cannot yet engineer a novel solution without oversight. The use of AI on real-life scientific problems remains an art rather than a routine technology.

9

From GWAS to drug: A framework for drug candidate prioritisation using a gene expression signature matching approach

Chauquet, S.; Jiang, J.-C.; Barker, L. F.; Hunter, Z. L.; Singh, G.; Wray, N. R.; McRae, A. F.; Shah, S.

2026-04-24 genetic and genomic medicine 10.64898/2026.04.22.26349470 medRxiv

Top 0.1%

1.7%

Show abstract

Drug targets supported by human genetic evidence have significantly higher approval rates, making genome-wide association studies a valuable resource for drug candidate prioritisation. Transcriptome-wide association study signature-matching is an emerging in silico approach that integrates GWAS data with expression quantitative trait loci to generate a disease gene expression signature, which is then compared against drug perturbation databases such as the Connectivity Map. Despite recent adoption, there is no consensus on optimal methodology. Here, we systematically benchmark key parameters, including TWAS method, eQTL tissue model, similarity metric, gene set size, and CMap cell line, using LDL cholesterol, familial combined hyperlipidemia, and asthma as proof-of-concept traits. We demonstrate that while TWAS signature-matching can successfully prioritise known first-line treatments, performance is highly sensitive to parameter choice; for instance, the selection of the cell line used for drug signatures alone can dramatically alter drug prioritisation. Based on these findings, we propose a best-practice framework for robust, genetically-informed drug prioritisation using TWAS signature-matching.

10

Do Larger Models Really Win in Drug Discovery?A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction

Guo, J.

2026-05-04 bioinformatics 10.64898/2026.04.29.721568 medRxiv

Top 0.1%

1.3%

Show abstract

The rapid growth of molecular foundation models and large language models has encouraged a scale centred view of AI in drug discovery, in which larger pretrained models are expected to supersede compact cheminformatics models and graph neural networks (GNNs) trained for individual tasks. We test this assumption across 26 endpoints for molecular properties, toxicity, safety liabilities and biological activity, grouped into ADME, toxicity and bioactivity classes. The benchmark contains 78 endpoint and split entries spanning random, Murcko scaffold and structure separated 5-fold CV. Ordered from easiest to hardest, these splits approximate retrospective evaluation on a closed library, scaffold expansion in hit to lead, and library expansion on novel chemotypes. Each entry includes ML, GNN, pretrained molecular sequence and LLM based SAR families. Across 156 fold mean comparisons, classical ML such as RF(ECFP4) and ExtraTrees(RDKit) win 116, GNNs such as GIN and Ligandformer win 25, pretrained sequence models such as MoLFormer and ChemBERTa2 win 12, and LLM based SAR baselines win three. ML dominates random split interpolation but loses part of this advantage under harder splits; GNN and sequence models also decline but gain relative ground, whereas LLM based SAR is weaker in absolute terms yet less sensitive to the split axis. Paired bootstrap analyses support family level trends more strongly than individual model rankings. SAR knowledge derived from training folds improves many GPT5.5-SAR and Opus4.7-SAR metrics but does not make rule based reasoning a universal substitute for supervised predictors. Compact specialized models remain highly effective for molecular property and activity prediction. Larger models add value for SAR interpretation and reasoning in low data settings, but predictive performance depends on the fit among model, task and validation scenario, not on scale alone.

11

HTS-Oracle X: AI-Guided Prospective Discovery of Small Molecule Immune Checkpoint Binders

Abdel-Rahman, S.; Gabr, M.

2026-06-22 bioinformatics 10.64898/2026.06.17.732853 medRxiv

Top 0.1%

1.2%

Show abstract

Targeting immune checkpoint protein-protein interactions (PPIs) using small molecules remains limited by the shallow, featureless binding surfaces of co-stimulatory and co-inhibitory receptors and the characteristically low hit rates of conventional high-throughput screening against these interfaces. Here we report HTS-Oracle X, a multimodal deep learning platform that integrates bidirectional cross-attention fusion of ChemBERTa SMILES embeddings with extended RDKit descriptors, trains on continuous biophysical binding signals rather than binary labels, and employs Monte Carlo Dropout uncertainty quantification for uncertainty-adjusted compound selection. Trained on 45,760 Dianthus TRIC-screened compounds per target under scaffold-aware cross-validation, HTS-Oracle X was applied prospectively to a 100,160-compound Enamine library against CD28, TIM-3, and VISTA. From 150 model-selected compounds, 45 dose-response confirmed binders were identified (30.0% overall hit rate), yielding enrichment factors of 234-408x over experimentally established random prospective baselines and 16 sub-micromolar hits. The top hits, HX-CD28-1 (KD = 233 nM), HX-TIM3-1 (KD = 249 nM), and HX-VISTA-1 (KD = 345 nM), demonstrated on-target functional activity in immune cell and tumor co-culture assays. HTSOracle X represents a scalable AI-guided framework for small molecule discovery against non-enzymatic immune checkpoint targets. Table of Contents artwork O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=115 SRC="FIGDIR/small/732853v1_ufig1.gif" ALT="Figure 1"> View larger version (47K): org.highwire.dtl.DTLVardef@187fac7org.highwire.dtl.DTLVardef@6150d7org.highwire.dtl.DTLVardef@10266f0org.highwire.dtl.DTLVardef@b436e1_HPS_FORMAT_FIGEXP M_FIG C_FIG

12

Quantum Encoding Strategies for Drug Response Prediction: An Exhaustive Benchmark on a 20-Qubit Superconducting QPU

Derouich, R.; Mathlouthi, N. E. H.

2026-07-13 bioinformatics 10.64898/2026.07.08.737310 medRxiv

Top 0.1%

1.2%

Show abstract

We present the first systematic, hardware-executed benchmark of twelve distinct quantum data-encoding strategies for drug-response prediction on a real superconducting quantum processing unit (QPU). All experiments were conducted on the IQM Garnet 20-qubit QPU via the IQM Resonance cloud platform, using the Qrisp quantum-software framework (v 0.8.2). Each encoding was evaluated on n = 50 stratified samples drawn from the Genomics of Drug Sensitivity in Cancer dataset (GDSC2, 242 036 drug-cell-line pairs), targeting the natural-log IC50 response variable. Variational weights were optimised offline with the gradient-free COBYLA algorithm before hardware submission. Every circuit was executed with 1024 shots; the regression signal is the zero-qubit Pauli expectation value [<]Z0[>]. Results show that the QAOA-inspired encoding achieves the best RMSE of 3.314 and is statistically superior (p < 0.05, Wilcoxon signed-rank test) to six of the remaining eleven encodings. Hardware-efficient entanglement structures--specifically alternating cost and mixer layers--provide a systematic advantage over purely rotational or diagonal encodings under realistic noise conditions. This work constitutes a reproducible baseline for noise-aware quantum machine learning on pharmaceutical data; all code, data, and raw QPU outputs are publicly released.

13

Agent-Driven Validation of Oncology Therapeutic Targets

Huang, K.-l.; Accelerated Discovery with Agents (ADA) Consortium,

2026-05-03 genomics 10.64898/2026.04.29.721634 medRxiv

Top 0.1%

1.1%

Show abstract

Selecting the correct target is critical in drug development, yet systematic replication of published target claims is rarely performed. Here, we introduce a replication-focused AI agent framework to evaluate 31 gene target-disease hypotheses, including context-specific oncology targets from both retracted and non-retracted papers. Each target claim was translated into a zero-shot validation prompt executed by a biomedical research agent in one round, and all agent-driven analyses were validated and scored by domain expert. Compared to retracted targets (2/17 validated, 11.8%), non-retracted targets (9/14 validated, 64.3%) were 17-fold more likely to show context-specific dependency in agent-driven analyses. The replicated targets include WRN in microsatellite stable cancer, PRMT5 in MTAP-deleted cancer, as well as more recent discoveries such as PTGES3, HASPIN, SLC5A3, PKMYT1, FAM126B, and PAPSS1. These results demonstrate that agent-human collaboration can conduct data-driven validation at scale, improve target prioritization, and systematically reduce translational risk for drug development.

14

On the applicability domain of HADDOCK3 for protein-aptamer docking: documented failure modes from a 5x7 cross-target screening matrix and a 1676 aa receptor case study (P01031)

Dohi, E.

2026-05-12 bioinformatics 10.64898/2026.05.11.724398 medRxiv

Top 0.2%

1.1%

Show abstract

We screened a 5 receptor x 7 aptamer = 35-cell cross-target matrix with HADDOCK3 [1] under blind ambiguous-interaction-restraint (AIR) protocols on AlphaFold-modelled receptors. The screen surfaced 12 operationally distinct failure modes (collapsing to [~]8 conceptual classes; [§]3.1). The K_D-calibration subset is n = 4 cells with literature K_D records under matched assay conditions; the broader cohort includes [≥] 6 biological cognate or intended-cognate cells. The principal case study is P01031 (complement C5, 1676 aa, [≥] 12 structural domains): all 7 panel members produced positive HADDOCK3 top-1 scores under a scale-adaptive AIR. Score-term decomposition locates the anomaly in the AIR term (+217 to +268 to top-1 score). With AIR zeroed, scores fall to -131 to -74 -- the small-receptor regime. Boltz-2 cofolding chain-pair ipTM (cpi_AB) is an independent channel: P01031 shows the lowest median cpi_AB (0.211; 0/7 above the 0.5 confident-interface threshold). To our knowledge, this is the first reported case study of a 1676 aa multi-domain receptor exhibiting this signature under blind scale-adaptive AIR -- an n = 1 mechanistic case, not a statistical generalisation. We adapt the QSAR applicability domain concept [14-16] to in silico aptamer screening. [§]3.7 reports an empirical Mode 1 mitigation (pLDDT-aware AIR prefilter; cohort Jaccard recovery [~]10x).

15

preSCRIPT: Large-scale prescription search and annotation engine for pharmacogenomic studies

Pieczarka, M.; Pienkowski, P.; Konowalska, P.; Grubarek, S.; Hajto, J.; Hoinkis, D.; Piechota, M.; Borczyk, M.; Korostynski, M.

2026-04-29 genetic and genomic medicine 10.64898/2026.04.28.26351989 medRxiv

Top 0.2%

1.1%

Show abstract

Pharmacogenetics (PGx) has traditionally focused on a small number of high-impact variants affecting drug response due to the fact that PGx studies are labor-intensive and therefore low-throughput. Population biobanks linked to electronic health records (EHRs), including the UK Biobank (UKB) with prescription data for [~]230,000 individuals offer opportunities to scale PGx research. This, however, comes with a challenge as EHRs do not provide direct treatment response outcomes. One way to overcome this is to draw indirect drug response phenotypes from prescription records. Here, we propose preSCRIPT, a framework to filter and annotate raw prescriptions from the UKB to derive phenotypes for analyses which includes an algorithm to distinguish short prescription gaps from true dose changes. As a proof of concept, we applied preSCRIPT to warfarin, paracetamol, codeine, amitriptyline, simvastatin, aspirin, and amlodipine and derived therapy length and median daily doses. We tested associations for those seven drugs and two phenotypes across SNPs, cytochrome P450 (CYP) genes, and HLA alleles. We replicated known associations such as CYP2D6 variants with amitriptyline therapy length and dose, CYP2C9/CYP4F2/CYP2C19 with warfarin dose, and CYP2D6 with codeine dose. For drugs without formal PGx guidelines, we identified an association between CYP2D6 activity and aspirin therapy length and several SNPs, including rs62471929 (CYP3A5), a variant for amlodipine dose, replicated in an independent hold-out set. Overall, our study shows that preSCRIPT can recover established PGx associations, prioritize exploratory novel candidate loci, and may serve as a tool for large-scale pharmacogenomics.

16

ADMET Property Prediction with Quantum-Inspired Preprocessing

Mansour, B.; Rafaelyan, G.

2026-07-05 bioinformatics 10.64898/2026.06.30.735582 medRxiv

Top 0.2%

1.1%

Show abstract

Accurate prediction of Absorption, Distribution, Metabolism, Excretion, and Toxicity (ADMET) properties is a central challenge in early-stage drug discovery, where experimental determination remains costly and time-consuming. In this work, we propose a quantum-inspired preprocessing framework in which statistical dependencies among molecular descriptors are encoded into a parameterised many-body Hamiltonian, and the expectation values obtained by simulating its time evolution serve as additional inputs to a gradient-boosted ensemble model (CatBoost). Mutual information (MI) is used both to select the most informative descriptors and to set the coupling strengths of the Hamiltonian, so that the induced entanglement structure reflects empirically measured feature correlations; the evolution is realised with a short digitised-counterdiabatic schedule that generates a compact set of expectation-value features while keeping the circuit shallow. The resulting quantum-derived feature vectors are concatenated with the full MapLight descriptor set, concatenated ECFP, Avalon, and ErG fingerprints together with RDKit physicochemical properties, before training. We evaluate the pipeline on the AqSolDB aqueous solubility benchmark from the Therapeutics Data Commons (TDC) platform, achieving a mean absolute error (MAE) of 0.746 +/- 0.006 log(mol/L), which is within the reported error bars of the current top-performing model on the TDC leaderboard (MAE = 0.741 +/- 0.013). Ablation experiments show that the quantum-derived features match classical second-degree polynomial interaction features derived from the same MI-selected subset, while forming a far more compact representation (85 quantum features versus up to 4,950 polynomial terms, an approximately 58-fold reduction). SHapley Additive exPlanations (SHAP) analysis identifies the physicochemical drivers of solubility predictions, offering interpretable insight into model behaviour. These results demonstrate that MI-guided Hamiltonian feature extraction can reproduce the performance of strong classical interaction models on aqueous solubility while generating a compact, interpretable feature representation that is compatible with future quantum execution.

17

VFUSE: Virulent Feature Understanding with Sparse autoEncoders

Olson, M. L.; Yu, M.

2026-06-11 bioinformatics 10.64898/2026.06.08.730928 medRxiv

Top 0.2%

1.1%

Show abstract

Generative models have shown remarkable progress in a variety of domains such as protein design, but such power enables the opaque generation of hazardous proteins. In this work, we introduce VFUSE (Virulent Feature Understanding with Sparse autoEncoders), a mechanistic interpretability approach that trains SAEs on diffusiontransformer activations to audit protein models for hazard-aware features. We apply VFUSE to RoseTTAFold3 and RFDiffusion3, popular openweight models for protein folding and synthesis. We find that for certain blocks, linear probes detect hazardous designs significantly better when fit in the SAE latent space over the original models representations: improving interpretability without sacrificing model performance. Furthermore, we identify monosemantic features from the SAE that fire only on hazardous designs at up to AUROC 0.84 (q < 10-13). To our knowledge this is the first SAE trained on an all-atom diffusion model and the first feature-level virulence audit of a protein design model, paving the way towards safe and interpretable protein design.

18

An Integrated Knowledge Graph and Network Medicine Pipeline for Drug Repurposing: Benchmarking Across Human Diseases and Application to Amyotrophic Lateral Sclerosis

Jiang, A.; Hu, J.; Abdulle, Y.; Pain, O.; Iacoangeli, A.

2026-07-08 bioinformatics 10.64898/2026.07.03.736387 medRxiv

Top 0.2%

1.1%

Show abstract

Drug repurposing offers a practical strategy to identify new therapeutic uses for approved drugs, potentially reducing the time and cost associated with conventional drug development. We present a novel three-stage drug repurposing pipeline that integrates knowledge graph-based gene prediction, network-based drug-disease association analysis, and systematic classification of candidate drugs by therapeutic class. The pipeline integrates DGLinker to predict novel disease-associated genes, SAveRUNNER to identify drug repurposing candidates, and ATC Category Enrichment Analysis (ATCEA) to prioritise candidates by pharmacological class. We benchmarked the pipeline across twelve diseases using DrugBank and MEDI2-HPS as validation resources. Utilising DGLinker-expanded disease-gene sets as input increased the number of predicted repurposed drugs, while overall discriminative performance remained stable across diseases (AUROC 0.71-0.77). Application of ATCEA consistently improved precision, F1-score, and specificity, while reducing recall, reflecting a conservative prioritisation strategy that contracts the candidate space while retaining pharmacologically coherent drug-disease candidates. We further applied the pipeline to amyotrophic lateral sclerosis (ALS), a neurodegenerative disease with limited therapeutic options, and performed a deeper literature-based validation of the results. Incorporation of DGLinker-predicted genes substantially increased the number of significant candidate drugs and uncovered enriched ATC categories not identified using known ALS genes alone, including antidepressants and antipsychotics. Moreover, several drugs with supporting evidence available in the literature were identified only when DGLinker-predicted genes were used. Overall, 77 candidate drugs were prioritised within significantly enriched ATC categories, several of which are supported by previously published studies. To provide exploratory real-world support for these findings, we further evaluated candidate drugs in a longitudinal electronic health record (EHR) dataset of 2361 patients with ALS from King's College Hospital. Although the number of evaluable drugs was limited due to sample size, the EHR analysis provided additional clinically relevant context for selected prioritised drugs and pharmacological classes. Our pipeline demonstrates potential to accelerate drug repurposing by integrating complementary computational approaches to each step of the process, providing an end-to-end framework that showed robust performance across benchmarking experiments and use cases.

19

BoltzProt-1: Towards Efficient De Novo Binder Design with Good Developability

Ucar, T.; Bates, J.; Fu, Y.; Shi, W.; Stark, H.; Nava, D.; Cavalleri, L.; Wohlwend, J.; Corso, G.; Passaro, S.

2026-06-27 bioinformatics 10.64898/2026.06.23.733997 medRxiv

Top 0.2%

1.1%

Show abstract

Designing binders against novel protein targets remains a central challenge in computational drug discovery. Here we introduce BoltzProt-1, a pipeline for generating protein binders, including nanobodies, with improved hit rates and favorable developability properties. At its core lie a refined iteration of BoltzGens generative model and a novel protein-protein interaction prediction model, BoltzPPI. Employing BoltzPPI instead of BoltzGens standard structure-prediction confidence metrics to rank nanobody (VHH) designs increases the confirmed-binder hit rate from 3.3% to 8.0% across 10 novel targets. Assessed on 10 additional targets used in prior literature, the BoltzProt-1 pipeline obtains nanobody screening hits for 7 of 10 targets, surpassing the 6 of 10 previously reported by Chai-2. Finally, evaluating the developability of BoltzProt-1-designed nanobodies in terms of stability, aggregation, purity, polyspecificity and hydrophobicity reveals that 58% of its confirmed binders pass every criterion, exceeding both BoltzGen (40%) and clinical-stage VHH controls (21%). O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=104 SRC="FIGDIR/small/733997v1_ufig1.gif" ALT="Figure 1"> View larger version (39K): org.highwire.dtl.DTLVardef@125fb31org.highwire.dtl.DTLVardef@8e7482org.highwire.dtl.DTLVardef@8318a1org.highwire.dtl.DTLVardef@c62ab5_HPS_FORMAT_FIGEXP M_FIG C_FIG

20

Sanjeevani: A manually curated anti-cancerous phytochemical database integrated with downstream analysis tools.

Jha, V.; Jha, R.; Shukla, S.; Shingan, S.; Das, G.

2026-06-19 bioinformatics 10.64898/2026.06.15.732344 medRxiv

Top 0.2%

1.1%

Show abstract

BackgroundCancer continues to pose a massive global health burden. While plant-derived phytochemicals offer promising therapeutic leads, existing natural product databases often lack cancer specificity, dataset downloadability, and integrated screening tools. MethodsWe developed Sanjeevani, an integrative web platform cataloguing 4,823 curated anticancer phytochemicals. Using a balanced dataset of 9,646 molecules, we trained Support Vector Machine (SVM), Random Forest, and K-Nearest Neighbours classifiers using a hybrid feature representation of RDKit descriptors and 2048-bit ECFP4 fingerprints. The platform also integrates AutoDock Vina for web-based molecular docking for binding affinity, poses prediction and ADMET-AI for pharmacokinetics estimation. ResultsThe SVM model demonstrated the strongest predictive capability, achieving a top test accuracy of 0.966 and a ROC-AUC of 0.992. Benchmarking across five docking tools confirmed that AutoDock Vina successfully balanced computational automation with literature-consistent binding affinity replication. The final architecture provides rapid interactive 2D/3D visualizations integrated with downstream analysis tools. ConclusionSanjeevani provides an open-access, one-stop pipeline that bridges the gap between raw natural product data and actionable computational screening, accelerating natural product-based oncology drug discovery. GRAPHICAL ABSTRACT O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=80 SRC="FIGDIR/small/732344v1_ufig1.gif" ALT="Figure 1"> View larger version (17K): org.highwire.dtl.DTLVardef@77183borg.highwire.dtl.DTLVardef@d7e465org.highwire.dtl.DTLVardef@1d3dfd7org.highwire.dtl.DTLVardef@10cc94d_HPS_FORMAT_FIGEXP M_FIG C_FIG